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Abstract 

Analyzing the similarities between genomic sequences is one 
of the principal methods used to investigate the evolutionary 
relationships between species. For relatively short 
sequences, such as nucleotide sequences of genes or amino 
acid sequences of proteins, alignment is widely used to 
evaluate the sequence similarity. However, the alignment is 
not practical for comparing very long sequences, such as 
genome sequences, due to its time-consuming nature. In this 
article, we propose a new method for graphical 
representation of DNA sequences, which falls into one of the 
major categories of alignment-free sequence comparison. We 
introduce a practical method for the numerical conversion of 
DNA sequences, in which we assign three-dimensional 
vectors in a symmetrical manner to the bases of genome 
sequences. We confirm the usefulness of our method in 
terms of the intuitive assessment of sequence similarities. 
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Introduction 

In comparative genomics, comparing genome 
sequences is one of the main tasks because sequence 
similarities strongly reflect the evolutionary 
relationships between the corresponding species. In 
addition, with the introduction of next-generation 
sequencing technologies, the demand for rapid 
comparisons of massive amounts of long sequences 
has increased in recent years. 

Sequence alignment (Smith and Waterman 1981; 
Needleman and Wunsch 1970) is generally used to 
compare relatively short sequences, such as nucleotide 
sequences of genes or amino acid sequences of 
proteins. The time complexity of the sequence 


alignment is 0 (/V 2 ) for sequences of length N, which 
indicates that the sequence alignment is very time- 
consuming when N is extremely large (i.e. for instance, 
alignment of whole genome sequences). Therefore, 
along with improvements in alignment-based 
methods, alignment-free methods are actively studied 
by many researchers to perform comparison between 
such long sequences. 

Graphical representation of DNA sequences is one of 
the alignment-free methods, which provide visual 
inspection of DNA sequences and makeit possible to 
compare DNA sequences instantly. Various schemes 
for the graphical representation have been proposed 
by several authors based on the projection of DNA 
sequences on 2D (Qi, Li, and Qi 2011; Huang et al. 
2011; Yu et al. 2010; Zhang 2009; Randic 2008; Qi and 
Qi 2007; Bielinska-Wqz et al. 2007a; Bielinska-Wqz et 
al. 2007b; Zhang and Chen 2006; Liu et al. 2006; Song 
and Tang 2005; Liao, Tan, and Ding 2005; Liao and 
Wang 2004d; Randic et al. 2003a; Randic et al. 2003b; 
Wu et al. 2003; Liu et al. 2002; Randic and Vracko 2000; 
Nandy 1994; Jeffrey 1990; Gates 1985), 3D (Xie and Mo 
2011; Yu and Sun 2010; Yu, Sun, and Wang 2009; Cao, 
Liao, and Li 2008; Qi, Wen, and Qi 2007; Qi and Fan 
2007; Liao and Ding 2006; Yao, Nan, and Wang 2005; 
Liao and Wang 2004a; Liao and Wang 2004b; Balaban, 
Plavsic, and Randic 2003; Zhang, Zhang, and Ou 2003; 
Randic et al. 2000; Hamori 1985; Hamori and Ruskin 
1983), or higher dimensional spaces (Liao et al. 2007; 
Chi and Ding 2005; Liao and Wang 2004; Randic and 
Balaban 2003). The basic procedure is common in 
almost all of the above-mentioned schemes: numerical 
conversion of bases of DNA sequences, consecutive 
mapping of the converted bases on a certain 
dimensional space to draw a graph, and estimation of 
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the similarity between the graphs. In this study, we 
propose a novel method for graphical representation 
of DNA sequences on a 3D space, in which we adopt 
symmetrical vector assignments, and introduce 
weighting in numerical conversion. 



kind of base for the weighting factor. Let P be the 
probability that a certain event occurs, thus the self- 
information I for the occurrence of the event is 
expressed by 

7 = -logP. (1) 


Here, we take the conditional probability of the 
occurrence of each base as P. A conditional probability 
is the probability that an event occurs given that 
another event has already occurred. For example, the 
conditional probability P( A I GC) measures the 
probability that base A appears after a pair of bases 
GC, which is computed by 


P( A | GC) = 


#GCA 

#GCA + #GCT + #GCG + #GCC ’ 


( 2 ) 


FIGURE 1 VECTORS ASSIGNED TO FOUR TYPES OF BASES. 

Method 

Symmetrical Vector Assignment 

We assign distinct vectors of a certain dimension to 
each of four types of bases. A, T, G, and C, for 
numerical conversion. By connecting the vectors 
corresponding to the bases extracted one by one from 
the head of the target genome sequence, we can 
perform its graphical representation. If we map 
genome sequences on a 2D space, the interrelationship 
between the resultant graphs may change according to 
the arrangement of the vectors due to their 
asymmetrical nature (i.e. not all the distances between 
the end points of each pair of vectors out of four can 
be equal). In this study, therefore, we map genome 
sequences on a 3D space using vectors represented by 
the vertices of a regular tetrahedron with edges of 
length 1 (FIG. 1). Here, we should emphasize that all 
the arrangements of the four bases on the vertices can 
be mutually transformed by rotation and/or space 
inversion not affecting the distances between the 
resultant graphs because the distance we will define is 
invariant under the rotation and the space inversion 
(see Distance measure between sequences ); therefore, only 
the configuration shown in FIG. 1 needs to be 
considered. 

Weighting Factors 

In order to evaluate effectively the information that 
each base in a genome sequence conveys, we assigned 
weighting factors to the vectors according to the 
appearance probabilities of the corresponding bases. 
We used self-information of the appearance of each 


where #W (W= "GCA", "GCT", . . . ) represents the 
number of occurrences of string W. 

As for the string length for calculating the conditional 
probability, we paid attention to the fact that amino 
acids are encoded by codons (i.e. triplets of bases) in 
genome sequences, and we used length three to get 
some information of the coding regions of the genome 
sequences, although the differences were not so large 
among the results obtained with different lengths 
(data not shown). 

We computed the conditional probabilities using all 
the genome sequences analyzed in this study. TABLE 
1 shows the weighting factors computed according to 
the above mentioned procedures based on tri- 
nucleotides. 


TABLE 1 WEIGHTING FACTORS FOR BASES 


Preceding 

sequence 


Base 



A 

G 

C 

T 

AA 

1.13 

2.05 

1.32 

1.27 

AG 

1.44 

1.53 

1.10 

1.54 

AC 

1.15 

2.32 

1.25 

1.21 

AT 

1.17 

2.00 

1.35 

1.22 

GA 

1.08 

1.61 

1.45 

1.49 

GG 

1.08 

1.72 

1.31 

1.55 

GC 

1.18 

2.76 

1.04 

1.29 

GT 

0.93 

1.91 

1.54 

1.41 

CA 

1.16 

1.99 

1.34 

1.25 

CG 

1.25 

1.67 

1.28 

1.40 

CC 

1.24 

2.51 

1.19 

1.13 

CT 

0.95 

2.11 

1.42 

1.39 

TA 

1.18 

1.77 

1.42 

1.27 

TG 

0.99 

1.61 

1.48 

1.61 

TC 

1.07 

2.36 

1.26 

1.28 

TT 

1.06 

2.10 

1.32 

1.33 


Graphical Representation 

Graphical representation of a genome sequence is 
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performed by connecting sequentially the weighted 
vectors corresponding to the bases in the genome 
sequence. In drawing a graph, the start point is set to 
the origin of the 3D space. Here, we show you a 
simple example of the procedure. Let "GATCA" be a 
nucleotide sequence. We begin the graphical 
representation with the third base "F because we need 
two preceding bases to assign the weighting factor. 
The corresponding vector to 'T' is 
(— 1/V3 , 1/V-L — 1/V3 ) (FIG. 1) and the weighting 
factor for T of "GAT" is 1.49 (TABLE 1). Then the 
coordinate value of 'T' is calculated to be 1.49 
(-1/V3,1/VT -1/V3) = (-0.86,0.86,-0.86) 
Similarly, the weighted vector for the next base 'C' is 
calculated to be 1.35 (l/V3,— 1/V-L — 1/V3 ) = 
(0.78, —0.78, —0.78) and is added up to the above 
coordinate value; that is, (—0.08, 0.08, —1.64) . This 
procedure is continued to the end of the sequence. 
Thus, the graphical representation of "GATCA" is 
completed (FIG. 2). 

When the appearances of the resultant graphs are 
similar for some genome sequences, the corresponding 
species can be considered to be closely related to each 
other, and when completely dissimilar, the 
corresponding species can be considered to be 
distantly related to each other. Note again that, due to 
the symmetric properties of the vector assignment (see 
Symmetrical vector assignment), the basic features of 
the resultant graphs are independent of the 
arrangement of the vectors, although the appearances 
of the graphs may change according to the 
arrangement. 



Distance Measure Between Sequences 

We need to define the distance between the resultant 
graphs to evaluate quantitatively the similarities 
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between the corresponding sequences. We divided 
each sequence into four segments of equal length, and 
created a 12-dimensional feature vector from the 
coordinate values of the four points — the three 
boundary points of the segments and the terminal. We 
then defined the distance between the sequences using 
the Euclidean distance between the feature vectors. 
That is, the square of distance L between two 
sequences is calculated by the following formula: 

L 2 = Iu=i(Xi - x[ ) 2 + (y t - y- ) 2 + (z; - z[ ) 2 , (3) 

where x t (x ; ) , y,- (y,- ) , and z,- (z i ) are the coordinate 
values of the z-th sampling point of the sequence; 
z'=l,2,3 corresponds to the three boundary points of the 
four segments of the divided sequence, and i= 4 
corresponds to the terminal of the sequence. 

We attempted two other variations in the number of 
sampling points for calculating the distances: all the 
points of the sequence and the terminal point only. As 
a result, we obtained the best performance when using 
the four-point sampling. Therefore, we consider only 
the four-point sampling in calculating the distances 
hereafter. 

Results And Discussion 

Data Set 

The nucleotide sequences of mitochondrial 
genomesfor 38 mammals were downloaded from 
GenBank and used for analysis (TABLE 2). 

Graphs And Effects Of Weighting 

Initially, we compared the graphs of closely related 
species — common chimpanzee and pygmy 
chimpanzee— in FIG. 3 (upper panel). We can find that 
the appearances of the graphs are very similar. The 
lower panel of FIG. 3 shows the same graphs but 
without weighting in numerical conversion of the 
sequences. It is evident in this figure that the 
weighting emphasizes the characteristics of the 
graphs, and makes it possible to distinguish between 
the graphs of close relatives. Next, we compared the 
graphs of distant relatives — dog and common 
chimpanzee— in FIG. 4. We can find that the 
appearances of the graphs are quite different. These 
observations support the usefulness of our new 
method of graphical representation in terms of 
intuitive assessment of sequenc similarities. 

Phylogenetic Tree 

We calculated the distances between all pairs of 
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species listed in TABLE 2 using Eq.(3), and 
constructed a distance matrix. FIG. 5 shows the 
phylogenetic tree created from the distance matrix. 
The tree was drawn by the statistical analysis software 
R based on the UPGMA (Unweighted Pair Group 
Method with Arithmetic Mean) method. 

TABEL 2 LIST OF MITOCHONDRIAL GENOMES OF 38 MAMMALS 
ANALYZED 


Species 

Common name 

Accession 

Homo sapiens 

Human 

V00662 

Pan paniscus 

Common chimpanzee 

D38113 

Pan troglodytes 

Pygmy chimpanzee 

D38116 

Gorilla gorilla 

Gorilla 

D38114 

Pongo pugmaeus 

Orangutan 

D38115 

Hylobates lar 

Gibbon 

X99256 

Papio hamadryas 

Baboon 

Y18001 

Equus caballus 

Horse 

X79547 

Ceratotherium simum 

White rhinoceros 

Y07726 

Rhinoceros unicornis 

India rhinoceros 

X97336 

Phoca vitulina 

Harbor seal 

X63726 

Halichoerus grypus 

Gray seal 

X72004 

Felis catus 

Cat 

U20753 

Panthera tigris 

Tiger 

EF551003 

Panthera pardus 

Leopard 

EF551002 

Balenoptera physalus 

Fin whale 

X61145 

Balenoptera musculus 

Blue whale 

X72204 

Bos taurus 

Cow 

V00654 

Bubalus bubalis 

Buffalo 

AY488491 

Rattus norvegicus 

Norway rat 

X14848 

Mus musculus 

Mouse 

V00711 

Dudelphis virginiana 

Opossum 

Z29753 

Macropus robustus 

Wallaroo 

Y10524 

Ornithorhyncus awtinus 

Platypus 

X83427 

Canis lupus familiaris 

Dog 

U96639 

Canis lupus chanco 

Wolf 

EU442884 

Sus scrofa 

Pig 

AJ002189 

Oins aries 

Sheep 

AF010406 

Loxodonta africana 

African elephant 

AJ224821 

Elephas maximus 

Asiatic elephant 

DQ316068 

Ursus thibetanus mupinensis 

Black bear 

DQ402478 

Ursus arctos 

Brown bear 

AF303110 

Ursus maritimus 

Polar bear 

AF303111 

Oryctolagus cuniculus 

Rabbit 

AJ001588 

Erinaceus europaeus 

Hedgehog 

X88898 

Microtus kikuchii 

Vole 

AF348082 

Sciurus vulgaris 

Squirrel 

AJ238588 




FIGURE 3 GRAPHS OF CLOSE RELATIVES WITH (UPPER 
PANEL) AND WITHOUT WEIGHTING (LOWER PANEL). 
SYMBOLS AND '+' SHOW THE SAMPLING POINTS TO 
CALCULATE THE DISTANCE BETWEEN THE SEQUENCES 

Common chimpanzee 



FIGURE 4 GRAPHS OF DISTANT RELATIVES -DOG AND 
COMMON CHIMPANZEE 


17 


www.seipub.org/rbb 



FIGURE 5 PHYLOGENETIC TREE BASED ON THE UPGMA 
METHOD REPRESENTING ALL THE SPECIES ANALYZED 


The configuration of the phylogenetic tree is largely 
inagreement with those in (Huang et al. 2011) and 
(Yuet al. 2010), with primates, bears, elephants, seals, 
and cats (cat, tiger, and leopard) being located in their 
respective clusters. However, certain species seem to 
be located on inappropriate positions. For example, 
the three rodent species (Norway rat, mouse, and vole) 
are separated from each other. One of the major causes 
of this anomaly seems to be the way of definition of 
the distance between sequences (see Distance measure 
between sequences). In our method, we take four 
sampling points of each genome sequence for 
calculating the distances. However, the configuration 
of these sampling points depends on the start point of 
the sequence. Currently, we take the head of the 
sequence data as the start point, although the start 
point of a mitochondrial genome is not apparently 
determined due to its circular form. We are now 
engaged in improving the definition of the distance 
between sequences considering the start point. 

FIG. 6 shows the graphs of several clusters of species, 
which are closely located in the phylogenetic tree (FIG. 
5). The appearances of the graphs in each cluster look 
similar, whereas those of the graphs between different 
clusters are highly dissimilar. This observation 
confirms that the phylogenetic tree properly reflects 
the similarities between the graphs. 
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Vole 

White rhinoceros 



African elephant 



FIGURE 6 GRAPHS OF THE SPECIES CLOSELY LOCATED IN 
THE PHYLOGENETIC TREE 


Conclusion 

We proposed a novel method for graphical 
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representation of DNA sequences. In this method, we 
assigned three-dimensional vectors represented by the 
vertices of a regular tetrahedron to each base, and 
gave weighting to the vectors based on the self- 
information of the appearance of the corresponding 
bases. Our method has a significant feature in that the 
quantitative outcomes with respect to sequence 
similarities are independent of the arrangement of the 
vectors due to its symmetric nature. 

By comparing the graphs of close and distant relatives, 
we confirmed the effects of weighting and the 
usefulness of our method in terms of the intuitive 
assessment of sequence similarities. Furthermore, we 
defined the distance between graphs to evaluate 
sequence similarities quantitatively, and constructed a 
distance matrix including all the species analyzed to 
create the phylogenetic tree based on the distance 
matrix with the UPGMA method. We classified the 
species into some clusters by gathering the species 
closely located in the phylogenetic tree to each other, 
and compared the appearances of the corresponding 
graphs within and between the clusters. The 
appearances of the graphs of the species in each 
cluster are similar to each other, whereas those 
between different clusters are dissimilar. We therefore 
conclude that our method is effective for evaluating 
sequence similarities on an intuitive basis. However, 
our distance measure requires some refinements since 
certain species were located at the incorrect positions 
in the phylogenetic tree. We are now improving the 
definition of the distance between sequences in terms 
of identifying the appropriate start point of 
mitochondrial genome sequences. 
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